Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation
نویسندگان
چکیده
Principal Subspace Analysis (PSA) -- and its sibling, Component (PCA) is one of the most popular approaches for dimensionality reduction in signal processing machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant modern era big data, which number samples and/or often exceed storage computational capabilities individual machines. This has led to study distributed solutions, data partitioned across multiple machines an estimate principal subspace obtained through collaboration among It this vein that paper revisits problem under general framework arbitrarily connected network lacks a central server. The main contributions regard threefold. First, two algorithms proposed can be used PSA/PCA, with case other (raw) features. Second, sample-wise algorithm variant it analyzed, their convergence true at linear rates established. Third, extensive experiments on both synthetic real-world carried out validate usefulness algorithms. In particular, MPI-based implementation interplay between topology communications cost as well effects straggler
منابع مشابه
A covariance-free iterative algorithm for distributed principal component analysis on vertically partitioned data
In this paper, a covariance-free iterative algorithm is developed to achieve distributed principal component analysis on high-dimensional data sets that are vertically partitioned. We have proved that our iterative algorithm converges monotonously with an exponential rate. Different from existing techniques that aim at approximating the global PCA, our covariance-free iterative distributed PCA ...
متن کاملPrincipal and Minor Subspace Tracking: Algorithms & Stability Analysis
We consider the problem of tracking the minor or principal subspace of a positive Hermitian covariance matrix. We first propose a fast and numerically robust implementation of Oja algorithm (FOOja: Fast Orthogonal Oja). The latter is said fast in the sense that its computational cost is of order O(np) flops per iteration where n is the size of the observation vector and p < n is the number of m...
متن کاملMapReduce Algorithms for Big Data Analysis
There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based...
متن کاملCommunication-efficient Algorithms for Distributed Stochastic Principal Component Analysis
We study the fundamental problem of Principal Component Analysis in a statistical distributed setting in which each machine out of m stores a sample of n points sampled i.i.d. from a single unknown distribution. We study algorithms for estimating the leading principal component of the population covariance matrix that are both communication-efficient and achieve estimation error of the order of...
متن کاملPrincipal Component Analysis and Higher Correlations for Distributed Data
We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrixA = A+A+. . .+A, with matrix A stored...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Signal and Information Processing over Networks
سال: 2021
ISSN: ['2373-776X', '2373-7778']
DOI: https://doi.org/10.1109/tsipn.2021.3122297